(N 

o 

(N 
•*-> 
O 

O 

m 

Em 



> 

(N 
(N 

d 



X 

c3 



Data Exploration, Quality Control and Testing in Single-Cell qPCR-Based 

Gene Expression Experiments 

Andrew McDavid 1 ' 2 , Greg Finak 2 , Pratip K. Chattopadyay 3 , Maria Dominguez 3 , 
Laurie Lamoreaux 4 , Steven S. Ma 4 , Mario Roederer 3 and Raphael Gottardo 1,2 * 

1 Department of Statistics, University of Washington, Seattle, WA 
2 Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA 
3 Immuno Technology Section, Vaccine Research Center, NIAID, NIH, Bethesda, MD 
4 Immunology Laboratory, Vaccine Research Center, NIAID, NIH, Bethesda, MD 

October 5, 2012 



Abstract 

Motivation: Cell populations are never truly homoge- 
neous; individual cells exist in biochemical states that define 
functional differences between them. New technology based 
on microfluidic arrays combined with multiplexed quantita- 
tive polymerase chain reactions (qPCR) now enables high- 
throughput single-cell gene expression measurement, allow- 
ing assessment of cellular heterogeneity. However very little 
analytic tools have been developed specifically for the sta- 
tistical and analytical challenges of single-cell qPCR data. 
Results: We present a statistical framework for the ex- 
ploration, quality control, and analysis of single-cell gene 
expression data from microfluidic arrays. We assess accu- 
racy and within-sample heterogeneity of single-cell expres- 
sion and develop quality control criteria to filter unreliable 
cell measurements. We propose a statistical model account- 
ing for the fact that genes at the single-cell level can be on 
(and for which a continuous expression measure is recorded) 
or dichotomously off (and the recorded expression is zero). 
Based on this model, we derive a combined likelihood-ratio 
test for differential expression that incorporates both the 
discrete and continuous components. Using an experiment 
that examines treatment-specific changes in expression, we 
show that this combined test is more powerful than either 
the continuous or dichotomous component in isolation, or a 
t-test on the zero-inflated data. While developed for mea- 
surements from a specific platform (Fluidigm), these tools 
are generalizable to other multi-parametric measures over 
large numbers of events. 

Availability: All results presented here were obtained us- 
ing the SingleCellAssay R package available on GitHub 
( jhttp : / / github . c om/RGLab/SingleCell Assay[ ) . 
Contact: rgottard@fh crc.orgl 

Supplementary Material: Supplementary data are avail- 
able. 



1 Introduction 

The development fluorescence-based flow cytometry 
(FCM) revolutionized single-cell analysis. Although even 
nominally-homogeneous populations of cells sorted by flow 
cytometry using established surface markers may appear 
monolithic, mRNA expression of specific genes within 



these cells can be heterogeneous (Dalerba et al. 2011) 



and could further discriminate cell subsets. On the other 
hand, classical gene expression experiments (micro-arrays, 
RNA-seq, qPCR) richly characterize a cellular population, 
but at the cost of reporting a summation of expression 
from many individual cells. Recent advances in microfluidic 
technology now permit performing thousands of PCRs in a 
single device, enabling rich gene expression measurements 
at the single-cell level across hundreds of cells and genes 
(Kalisky and Quake 2011). This provides a technology 
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that probes the stochastic nature of biochemical processes, 
resulting in relatively large cell-to-cell expression variability. 

This heterogeneity may carry important information: 
thus single cell expression data should not be analyzed in 
the same fashion as population-level data. At the scale of 
a single cell, biological variability (the object of interest) 
and technical variability (a nuisance factor) are often of the 
same magnitude, making it difficult to distinguish between 
the two. The dichotomous nature of single-cell data further 
complicates matters: measurements on individual cells may 
be dichotomously absent due to real biological effects, so 
are not always present on a continuum. These features of 
single-cell data require special attention during analysis. 

Here we focus on the reverse-transcriptase qPCR (rt- 
qPCR) -based Fluidigm (San Francisco, CA) single-cell 
gene expression assay, which provides simultaneous mea- 
surements of up to 96 genes on mRNA sources as minute as 
a single cell. In traditional RT-qPCR , despite careful mea- 
surement of starting concentrations of cDNA, correction for 
differences in quantities of starting material below the limit 
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of detection is necessary for reliable results ( Vandesompele 2 Methods 



et al. 2002). Subtraction of internal control genes, or aver- 



ages thereof is typically used (e.g., the A-Ct method), and 
results are often reported in numbers of copies, or fold in- 
crease per cell (ISchmittgen and Livakl 120081) . In array-based 



gene expression, differences in hybridization and washing of 
non-specific DNA between chips require additional correc- 
tion. 

Such normalization schemes are not directly applicable 
in single-cell gene expression experiments, nor is it obvious 
that they are needed. For single cells, the individual cell is 
the atomic unit of normalization and the amount of starting 
material naturally measured in number cells per reaction. 
Even if one attempted direct application of traditional nor- 
malization approaches, the dichotomous nature of single-cell 
expression hinders their use. 

Nonetheless, it is important to test for and address any 
technical biases. We present a filtering approach for remov- 
ing outlying measurements at the single-cell level that ac- 
counts for the dichotomous nature of the data. Using con- 
cordance measures derived from three data sets where gene 
expression was measured at the single-cell and hundred-cell 
levels, we show that classical RT-qPCR type normalization 
is not necessary with single-cell multiplexed PCR data and 
that our filtering step removes technical artifacts that most 
severely impact quantitation. 

A typical goal of gene expression experiments is to search 
for differential expression across groups. The dichotomous 
nature of expression in Fluidigm introduces problems for 
testing differential representation of cell subsets character- 
ized by expression patterns, as well. Traditional tests of 
differential expression such as the t-test or other approaches 
based on normality arc likely inappropriate for zero-inflated 
data dSmythl 120041 iGottardo et all 120061). Approaches 



Powell et al.\ (|2012j) used a 
then 



to this problem have varied 

winsorized z-transformation of the expression values 
treated them as continuous 



Glotzbach et al. (20111 used 



the non-parametric, Kolmorgov-Smirnov test for differences 
in distribution to find differentially expressed genes, after 
winsorizing. Flatz et al. (2011) dichotomized the expression 
and worked with the binary trait. However, as we will see 
later, both the continuous and discrete parts of the measure- 
ments arc informative for differential expression and should 
be used. A parametric test allows directions of difference to 
be assessed. 

Here we propose a discrete/continuous model for single- 
cell expression data based on a mixture of a point mass 
at zero and a log-normal distribution. Using this model, we 
derive a likelihood ratio test that can simultaneously test for 
changes in mean expression (conditional on the gene being 
expressed) and in the percentage of expressed cells. 



2.1 Data sets and notations 

We use three Fluidigm single-cell gene expression data sets 
described below. We offer a brief overview of the assay tech- 
nology used for our data. Desired cells (e.g. antigen-specific 
CD8+T cells) are selected and lysed, and a cDNA library is 
generated through RT-qPCR . A short ( c. 15 cycle), multi- 
plexed pre-amplification selects and enriches for the desired 
genes. These products are loaded onto the Fluidigm chip 
and gene-specific primers are added for single-cell gene ex- 
pression quantification. For the data presented here we used 
a 96 x 96 format plate, i.e. 96 genes across 96 cells. The de- 
sign of the chip generates each combination of the 96 genes 
and 96 enriched cDNA libraries producing 9216 separate 
PCR reactions. After each cycle, the fluorescence is read. 
The cycle (or interpolated fraction thereof) at which the 
fluorescence crosses a pre-determined threshold is recorded, 
defined as the "cf value. For all data sets considered here, 
primers were chosen to have > 90% amplification efficiency 
Data set A: Twenty-eight 96 x 96 format plates of CMV- 
or HIV-spccific CD8+ single cell T cells were isolated from 
16 individuals. The donors' cells were stimulated with 
one of four tetramers. Cells were sorted immediately af- 
ter tetramer incubation ("unstimulated") or after 3 hours 
of exposure ("stimulated"). Approximately 90 individual 
cells were measured for each patient-stimulation combina- 
tion ("unit"). 

Data set B: Ten subjects were considered, and approxi- 
mately 180 cells were sorted per subject, with each subject 
crossed between two arrays. 

Data set C: Two subjects were considered. Fluorescent 
staining of CD4+ T cells allowed cytometric sorting into 
CD154+/- sub-populations. Approximately 40 cells were 
sorted per sub-population per subject across three arrays. 

Additionally, for each individual and treatment within 
each data set, aggregates of 100 cells (i.e., 100 cells per 
well on the array) were isolated and assayed by Fluidigm 
technology. The expression measured in these 100-cell ag- 
gregates, after dividing by 100, provides a "biological" av- 
erage of expression per-cell, and can be compared to an in 
silico average of the single-cell measurements. The concor- 
dance between these two averages serves as a measure of 
experimental fidelity (|Lin| |1989|) 



Notations: The stan- 
dard assumptions of qPCR-based assays apply to the Flu- 
idigm technology, namely that the cycle threshold (ct) is 
inversely proportional to the log of fluorescence. The fluo- 
resence is directly proportional to the starting concentration 
of mRNA ( |Higuchi et U\ |1992| |Karlen et aTj |2007[ ). The 
Fluidigm instrument returns the cycle threshold (ct), how- 
ever, we find it more useful to work with the complement of 
ct, which we define as the expression threshold (et) 



et = c n 



ct 



where c max is the maximum number of cycles used, 40 in 
our case. Assuming all reactions are in the exponential 
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amplification phase, this quantity should be directly pro- 
portional to the log-abundance of mRNA, plus an intercept 
term corresponding to the number of cycles it takes for the 
minimally-detectable quantity of mRNA to cross threshold. 
If the fluorescence does not cross the threshold after 40 cy- 
cles, then the Fluidigm instrument records a value of N/A, 
and we say that the gene is not detected. As we will see 
in the results section, detected genes typically have a value 
of ct much less than c max suggesting that undetected genes 
might be regarded as unexpressed genes. This assumption 
is supported by the idea that transcription of mRNA is 



thought to occur in bursts of activity ( Levsky et al. 2002 



Kaufmann and van Oudenaarden 2007 1 , followed by quies- 



cence. Other authors have noted this feature in single cell 
expression studies as well (Glotzbach et al. 2011). When 



looking at the concordance of the single-cell and hundred- 
cell experiments, this assumption is reasonable and leads 
to better concordance than omitting the N/A values. As a 
consequence, we treat the undetected genes as unexpressed 
genes, and we set the corresponding et value to — oo, so 
that the mRNA abundance is zero (i.e. 2 et = 0). For a 
fixed sample or experimental unit, let us denote by etij the 
expression threshold of well i and gene j , for i — 1 , . . . , I 
and j = 1, . . . , J. This results in a matrix of log 2 based 
expression values, ET = (etij), just as in array-based gene 
expression. Similarly, we will denote by Y = (ytj) the ma- 
trix of untransformed expression values where = 2 etij . 
Usually a well contains one cell but the Fluidigm technol- 
ogy can be used with multiple cells per well to quantify the 
gene expression of a mixture of cells. As a consequence, we 
prefer to use the term "well" instead of "cell" . In the three 
data sets used here, wells will contain either one or hundred 
cells. Finally, several biological units are typically measured 
in an experiment, and in this case we will use an extra index 
k to refer to the biological units. 



Figure 1: Histogram and theoretical (normal) distribution 
of (etij\vij = 1) for single cell (left, light gray) and hundred 
cell experiments (right, dark gray). Genes FASLG, IFN-7, 
BIRC3 and CD69 are depicted. The frequency expression of 
each gene in the single cell experiments is printed above 
each histogram. The mean of the hundred cell and single 
cell experiments is indicated by a thick black line along the 
x-axis. 



2.2 A model for single cell expression 

As described previously, for a given cell, a gene can be de- 
fined as on (i.e. a positive et value is recorded) or as off 
(i.e. the gene is undetected and y™ =0). To simplify our 
model, we will denote by Ujj = 1 [y^- = 0] the indicator vari- 
able equal to one if the gene j is expressed in well i and zero 
otherwise. Following classical statistical conventions, we use 
upper cases to denote the random variables, and lower cases 
to denote the values taken by these random variables. Using 
these notations, we introduce the following model of single- 
cell expression 

(Y y |Vy = l) ~ logNormalG^crf) (1) 
(y y |Vy=0) ~ S (2) 
Vu ~ Befa) (3) 

where So denotes a point mass at zero, (ij and cr? are the 
log 2 -based mean and variance expression level parameters 
conditional on the gene being expressed (i.e. Vij — 1), and 
TTj is the frequency of expression of gene j across all cells. 



3 



In the data sets considered here, the frequency of expression 
greatly varies across genes from to .99 with a median value 
of TTj around .1 (see Supplementary Figure 1). Note that as- 
suming a log- Normal model for (Yij\Vij = 1) is equivalent to 
modeling (ET^lVy = 1) as normally distributed. The em- 
pirical distribution of the data (Figure [T] and Supplementary 
Figures 4-6) motivates our selection of a log-normal distribu- 
tion and follows observations of previous authors ( Bengtsson 



et al. 2005 1 



Thus in a particular gene, three parameters characterize 
the expression distribution: fij , aj , the mean and standard 
deviation of the etij\Vij = 1, and Ttj, the Bernoulli proba- 
bility of expression. 

2.3 Quality Control and Filtering 

The Fluidigm assay is sensitive, and due to the exponential 
amplification of starting mRNA, even minute contamination 
can render a measurement unreliable. Similarly, variation in 
cell preparation can have significant impact on the result- 
ing experiment and data, such as unintentional empty wells, 
which would distort estimates of nj. This suggests testing 
for, and possibly removing outliers before conducting fur- 
ther analysis. We examine both the discrete component Vij 
and the continuous component (etyluy — 1) in screening 
for outliers. We define the robust z-transformed positive 
expression value as 



etij 



mediani(eiij) 



k ■ MADi(etij) 



where the median and median absolute deviation are calcu- 



lated, for a given gene, over expressed cells (i.e. 



1). 



and k — 1.48 is a scaling constant that gives the standard 
deviation in terms of the MAD for the normal distribution. 
Next, let fi = asiny^ be the Bernoulli variance-stabilizing 
transformation of the proportion of genes expressed in well 
i. Then we define the robust z-transformed fraction as 



fi - medianj(/i) 
k-MKDiifi) 



where the median, MAD and k are as defined previously. 
This leads to the following steps for filtering: 

1. Remove null cells with no detected genes, i.e. Vij = 0, 
for all j. 

2. Pick threshold for z filtering (t z ); threshold for ( filter- 
ing (t c ). 

3. Calculate Zij and £i 

4. Remove wells in which genes have \z\ > t z OR \Q > t^. 

Step 1 removes wells where no cells were loaded, and thus all 
measured expression values are null. It is important to per- 
form this step first to prevent break-down in the median and 
MAD estimates for the £'s in experiments with many ampli- 
fication or flow cytometry failures. Finally, step 4 removes 



unreliable wells that either have an extreme proportion of 
expression or extreme cell x gene expression values. In prac- 
tice, we find that picking t z = 9, = 9 works well for the 
data sets we consider here, see section [3] (Results). 

2.4 Testing for ET differences between ex- 
perimental groups 

One typical goal of gene expression analysis is to test for dif- 
ference in expression patterns between experimental units. 
Here, we focus on testing differential gene expression be- 
tween two paired-biological units, e.g. before and after stim- 
ulation. Our framework should be generalizable to other 
types of situations, see Discussion section. The classical 
test for changes in mean for samples with continuous mea- 
surements is the i-test. Conversely, if only a change in ir 
were of interest, then a contingency table test (Chi-square, 
Fisher's Exact or Bernoulli likelihood ratio) is appropriate. 
However, in our case, we would like to test for a change 
in (i and it simultaneously, since both could be biologically 
relevant. Formally, we wish to test 

H : 7T = 7ri and = Ml 

versus the alternative 

H a : ttq ^ 7Ti and /i ^ fxi. 

This can be accomplished using a likelihood ratio test that 
would simultaneously test for differences in means or pro- 
portions of expression. 

Suppose that / wells are assayed in each unit, though the 
unbalanced case (Iq 7^ Ii) would be treated similarly with 
obvious changes of notation. Based on 0, the likelihood 
function for one gene across two biological units, omitting 
the gene index j for clarity, is given by 

l^v) =n^*( i -' r *) 7 " n * n 9{vik\^y) (4) 

where y and v are the vectors of observations for the gene 
across the two groups, = {^ k , c 2 , -k^-. k = 0, 1} is the vector 
of unknown parameters, Sk is the set of cells expressing the 
gene in group k (i.e. S k = {i : v ik = 1}) , n k = ^ v ik is the 
number of cells expressing the gene in group fc, and g is the 
density function of the log-normal distribution with param- 
eters /ife and er?. The likelihood ratio test (LRT) statistic 
A(y, v) is then defined as the ratio of the null and alterna- 
tive likelihoods obtained by replacing the unknown parame- 
ters with their null and alternative maximum likelihood es- 
timates. Detailed derivations of the likelihood function and 
the LRT statistics are described in supplementary material. 

An interesting observation is that the likelihood function 
given by Q is log-linear in the parameters tt and (/i,ct 2 ) 
since it is the product of the Bernoulli likelihood for all cells 
and the log-normal likelihood for the expressed cells. It fol- 
lows that the log-LRT statistic decomposes as a sum of a 
Bernoulli log-LRT test statistic and a lognormal log-LRT 



4 



test statistic, since each component can be maximized inde- 
pendently. It thus combines the two sources of information 
in a natural way. In the results section we will show that 
our combined LRT test is more powerful than the Bernoulli 
or log-normal tests alone. 



3 Results 

3.1 Distributional assumptions 

In Figure [I] we observe good agreement between the empir- 
ical distributions of positive et values and their postulated 
normal distribution for four genes in data set A. This con- 
firms that a log-normal model for the positive expression 
level, yij\vij — 1, is appropriate. Even cells in the low- 
est quantiles of et (and lowest quantiles of expression) still 
have expression far away from the bound at 0, suggesting 
that undetected genes represent cells with null or negligible 
RNA abundance. It is also noteworthy that the difference 
between the means (shown as a heavy, vertical line) of the 
hundred cell replicates and single cell replicates is approx- 



imately log 2 (100 • TTj ) cycles, where tTj X ' is the expression 
frequency of gene j in the single-cell experiments. As such, 
in genes with iri <C 1, such as FASLG, this difference be- 



(i) ic 



tween means is smaller than genes with 7r 



(i) 



1. As we 



will see the next section, inclusion of the unexpressed cells 
(vij = 0) is important to accurately relate the expression 
level of the single-cell experiments to the hundred-cell ex- 
periments. 



Concordance between hundred-cell and 
single-cell experiments 



3.2 



The 100-cell aggregates (see section [2T| data sets and nota- 
tion) allows us to assess the accuracy and reliability of our 
single-cell experiments by comparing this in-vitro 100-cell 
expression to an in-silico estimate obtained by averaging 
the expression of 100 single-cell measurements. The in silico 
average of signal in a gene j and unit k from 100 single-cell 
wells is = 5Zi=i 2/ij'fe/100 where y^ is the expression 
measurement of gene j in cell i and unit k. We compare this 
to the in-vitro "average" of signal from a 100 cell aggregate. 
In this case, we just use the expression of a gene-unit and 
divide by the number of cells (100). 

The concordance here is assessed both visually by plotting 



logoff + 1) 



1 



) (Figure p| and by calculat- 



ing the concordance correlation coefficient (r c ) between the 
two variables, which is often used to quantify reproducibil- 
ity (Lin 1989). The shifted log transformation allows vi- 



sualization of both the discrete and continuous components 
while being on the et scale. 

We first use this concordance experiment to test whether 
wells that do not cross the fluorescence threshold after c max 
should be treated as exact zeroes or missing values. If 
we suppose that = implies an assay failure and the 



measurement should be discarded, we would simply com- 
pute the single cell average over expressed cells, i.e. y^ — 
^2iyij v ij/^2i v ij- Figure [2] demonstrates good concordance 
between the hundred-cell and single-cell experiments when 
the undetected genes are treated as zeros. However, this is 
not the case when the zeros are treated as missing values. 

3.3 Filtering outlying cells 

In addition to the concordance measure r c , we use another 
goodness-of-fit measure to optimize our filtering parameters 
t z , defined by, 



WSS = 5>i* (log 2 (l$ + 1) - log 2 (|/S 00) + IJK 



where 



(5) 

Jfc — J2i v ijk is the number of positive wells for gene 
j in unit k in the single-cell experiments. For a particular 
gene and unit, the WSS decreases as we lower the filtering 
threshold and extreme values are filtered. Eventually, so 
many cells are removed that there is zero expression (and a 
large deviance) for the in-silico estimate. Thus we wish to 
find a set of values for the filtering parameters that would 
lead to the lowest WSS measure across the three data sets 
used here. The addition of the scaling factor Ujk gives higher 
weight to combinations with more ex ante positive observa- 
tions, so that the contribution to the sum of squares would 
be smaller in gene x unit combinations that have fewer ex- 
pressed cells. The factor rijk can also be interpreted as the 
scaling factor for the variance of the mean over positive ob- 
servations. Finally, the WSS is computed on the log 2 (y + 1) 
scale to reduce the effect of extreme outliers. 

When hundred-cell aggregates are available, one can opti- 
mize the filter parameters t z , t^ by minimizing the WSS over 
possible combinations. In our case, we found that setting 
t z = 9, = 9achieves the best reduction in WSS across the 
three data-sets explored here (Supplementary Figure 2, and 
Supplementary Table 1). Using these values, our filtering 
criteria moderately improves the concordance between the 
single-cell and hundred-cell experiments in two of the data 
sets but dramatically improves (decreases) the weighted sum 
of squares. We see that the average per-unit et of multiple 
genes are moved towards the diagonal. 

3.4 Normalization and housekeeping genes 

Other authors have noted that "the gene transcript number 



is ideally standardized to the number of cells" ( |Vandesom- 
pele et al. 2002 1 , which is the case with gene expression from 



sorted cells. So it is not entirely a surprise that we find little 
evidence for housekeeping genes providing useful normaliza- 
tion here. For a housekeeper to have good validity, it should 
have high cross correlation with other housekeeping genes. 
This is not the case for housekeepers GAPDH and POLR2A, 
which in data set A, in linear regression have an R 2 = .027. 
In Supplementary Figure 3, we observe in scatter plots of 
housekeepers' et that the correlation drops even further (to 
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Figure 2: Concordance between hundred cell ?/ lclo )/100 and the in silico average of single cell wells for data sets A, 
B, C. In the top row, wells with Vij = are included and treated as exact zeroes. In the middle row they are excluded, 



resulting in a clear lack of concordance. In the final row, wells are filtered as per section 2.3 Dark, thin lines show 



the initial location of a gene before filtering and connect to the location of the gene after filtering. In each panel, r c , 
the concordance correlation coefficient and WSS, the average weighted squared deviation of expression measurements is 
printed. The dotted black line shows a lowess fit throught he data. In all cases, the expression values are transformed 
using a shifted log-transformation (log 2 (y + 1)). As such a graphed value of zero corresponds to a zero expression value 
(i.e. y = 0). 
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Figure 3: Number of discoveries (genes x units) versus False 
Discovery Rate, by treatment, data set A. The combined 
likelihood ratio test is compared to a Bernoulli or normal- 
theory only likelihood ratio test, as well as a t-test of the raw 
expression values (2 et scale), including zero measurements. 

an R 2 = .017) after filtering outlying cells (see previous sec- 
tion). Since the correlation between housekeepers is present 
primarily in cells we suspect suffered from technical error, we 
find little utility in normalization schemes. In fact, the use 
of housekeeping genes for normalization could even result in 
masking cellular artifacts that should be filtered. 

3.5 An efficient test of differential expres- 
sion for single-cells 



In data set A, approximately 90 cells in each of 16 subjects 
were measured in unstimulated and stimulated states (see 
2.1). This permits conducting a test for each gene in each 

subject for differences in n and fi, as described in section 4. Conclusion 
2.4| We plot the number of discoveries at various false dis- 
covery rates (FDR) in Figure [3] The combined likelihood 
test produces the greatest number of discoveries over a wide 
range of FDR. For example, at an FDR of 1%, our com- 
bined test could detect more than 20 additional gene x unit 
changes across the four stimulations. 

Another feature of the combined LRT is its robustness to 



Figure 4: — log 10 P of tests (genes x units) versus frequen- 
cies of expression it of the genes. The Bernoulli, normal- 
theory and combined likelihood ratio tests are plotted. 

shown in Figure [4] in which — log 10 p- values are shown for 
the Bernoulli, normal and combined LRTs versus frequency 

Of 7Tj. 

A total of 65 genes were detected at an FDR of 1% in at 
least one individual. We define p* — — sign(/zi — ^o) 1 log 10 j? 
as the negative log 10 p-value times an indicator variable 
which is positive when stimulated groups have greater ex- 
pression, and negative otherwise. Figure [5] plots a heatmap 
of signed log 10 p-values. The selected genes are in clustered 
rows; individuals are in columns arranged by stimulations. 
The color above each column indicates which antigen stim- 
ulation the individual received. From this, it is clear that 
genes cluster into up-regulated and down-regulated modules 
and that there is much individual variability in response. 



background gene frequency ttj. Of course, if iXj k on aver- 
age, then any test will be unpowered to detect group differ- 
ences. But using only the continuous components amounts 
to "throwing away" data for genes with intermediate 7Tj. 
And similarly, using only the dichotomous component re- 
sults in a test insensitive to differences in in frequently 
expressed genes. This robustness to the TTj spectrum is 



Current approaches for analysis of single-cell assays have in- 
completely utilized the salient features of the experiment, 
and the resulting inference can be sub-optimal. In this pa- 
per, we have presented a framework for data exploration, 
quality control and testing for differential expression us- 
ing single-cell data. Our comparison of 100-cell and single- 
cell measurements shows that undetected genes in an assay 
should be treated as effective "zeroes." Both the discrete, 
zero-inflated portion and continuous portion of single-cell 
expression data are meaningful for detecting outliers. More- 
over, differences in either could be of biological interest, so it 
is desirable to combine evidence from both to detect changes 
in expression. Our likelihood ratio test allows just that. 
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Figure 5: Heatmap of signed log 10 p for selected genes (rows) 
and all individuals (columns). The color above each column 
indicates the stimulus applied to the cells. Red and purple 
are two CMV peptide pools; yellow and orange are two HIV 
peptide pools. 



Although we have suggested default parameters for the 
filtering of outliers, informed from several data sets, our de- 
faults are likely conservative. They are 3-4 times larger than 
the most substantial difference in expression between exper- 
imental groups we observed. Therefore, scientists may wish 
to tune these parameters based on their own relative impor- 
tance of eliminating potential technical error versus missing 
biological heterogeneity. Acquiring forms of ground-truth 
besides "bulk" experiments (in our case, 100-cell aggregates) 
could allow forming tighter bounds. 

Further work, incorporating a mixed-effects model to our 
likelihood ratio test, could extend its applicability. The test 
outlined in this paper may not be appropriate in cases where 
traits of interest are not blocked within individuals (e.g., 
comparing between phenotypes like HIV+ vs. HIV-). In this 
case, one wishes to identify gene expression changes across 
groups, in spite of high individual-to-individual heterogene- 
ity. By modeling the mean and proportion of expression as 
common across groups and adding specific random effects for 
between-individual variability, our model could be extended 
to address such experimental questions as well. 

Single-cell gene expressions assays have already been 
shown to be useful in multiple studies and will become even 
more routine once sequencing at the single-cell level becomes 



practical ( |Varadarajan et al. 2011 Ramskold et al. 2012). 



As a consequence, the development of effective statistical 
methods to analyze such data is becoming increasingly im- 
portant. This paper offers a coherent framework for re- 



searchers using this nascent technology. 
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